Locality-Sensitive Hashing for Protein Classification

نویسندگان

  • Lawrence Buckingham
  • James M. Hogan
  • Shlomo Geva
  • Wayne Kelly
  • Richi Nayak
  • Xue Li
  • Lin Liu
  • Kok-Leong Ong
  • Yanchang Zhao
  • Paul Kennedy
چکیده

Determination of sequence similarity is a central issue in computational biology, a problem addressed primarily through BLAST, an alignment based heuristic which has underpinned much of the analysis and annotation of the genomic era. Despite their success, alignment-based approaches scale poorly with increasing data set size, and are not robust under structural sequence rearrangements. Successive waves of innovation in sequencing technologies – so-called Next Generation Sequencing (NGS) approaches – have led to an explosion in data availability, challenging existing methods and motivating novel approaches to sequence representation and similarity scoring, including adaptation of existing methods from other domains such as information retrieval. In this work, we investigate locality-sensitive hashing of sequences through binary document signatures, applying the method to a bacterial protein classification task. Here, the goal is to predict the gene family to which a given query protein belongs. Experiments carried out on a pair of small but biologically realistic datasets (the full protein repertoires of families of Chlamydia and Staphylococcus aureus genomes respectively) show that a measure of similarity obtained by locality sensitive hashing gives highly accurate results while offering a number of avenues which will lead to substantial performance improvements over BLAST. .

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Distance-sensitive hashing

We initiate the study of distance-sensitive hashing, a generalization of locality-sensitive hashing that seeks a family of hash functions such that the probability of two points having the same hash value is a given function of the distance between them. More precisely, given a distance space (X,dist) and a “collision probability function” (CPF) f : R→ [0, 1] we seek a distribution over pairs o...

متن کامل

Locality-sensitive hashing and biological network alignment

Large biological networks contain much information about the functionality of protein-protein interactions and other macromolecular relationships. There are certain small subsets of these graphs which display similar characteristics that appear more frequently than others; these motifs can be used to uncover key information about the structure of the networks. Existing algorithms use exact iden...

متن کامل

Fast Image Search with Locality-Sensitive Hashing and Homogeneous Kernels Map

Fast image search with efficient additive kernels and kernel locality-sensitive hashing has been proposed. As to hold the kernel functions, recent work has probed methods to create locality-sensitive hashing, which guarantee our approach's linear time; however existing methods still do not solve the problem of locality-sensitive hashing (LSH) algorithm and indirectly sacrifice the loss in accur...

متن کامل

Multi-Level Spherical Locality Sensitive Hashing For Approximate Near Neighbors

This paper introduces “Multi-Level Spherical LSH”: parameter-free, a multi-level, data-dependant Locality Sensitive Hashing data structure for solving the Approximate Near Neighbors Problem (ANN). This data structure is a modified version multi-probe adaptive querying algorithm, with the potential of achieving a O(np + t) query run time, for all inputs n where t <= n. Keywords—Locality Sensitiv...

متن کامل

Locality-Sensitive Hashing with Margin Based Feature Selection

We propose a learning method with feature selection for Locality-Sensitive Hashing. Locality-Sensitive Hashing converts feature vectors into bit arrays. These bit arrays can be used to perform similarity searches and personal authentication. The proposed method uses bit arrays longer than those used in the end for similarity and other searches and by learning selects the bits that will be used....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014